SVM2Motif—Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor
نویسندگان
چکیده
Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.
منابع مشابه
SVM2Motif—Reconstructing Overlapping Sequence Motifs by Mimicking an SVM Predictor
Major technological advances in sequencing techniques within the past decade have facilitated a deeper understanding of the mechanisms underlying the functionality and evolution of molecular processes. Considering the sheer size of many genomes, it comes, however, at the expense of an enormous amount of data that demands for automatic and computationally efficient methods in genomic discriminat...
متن کاملUsing evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins.
Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition me...
متن کاملBacillus subtilis CodY operators contain overlapping CodY binding sites.
CodY is a global transcriptional regulator that is activated by branched-chain amino acids. A palindromic 15-bp sequence motif, AATTTTCNGAAAATT, is associated with CodY DNA binding. A gel mobility shift assay was used to examine the effect of pH on the binding of Bacillus subtilis CodY to the hutPp and ureAp(3) promoters. CodY at pH 6.0 has higher affinity for DNA, more enhanced activation by i...
متن کاملA Two-Stage Evolutionary Approach for Effective Classification of hypersensitive DNA Sequences
Hypersensitive (HS) sites in genomic sequences are reliable markers of DNA regulatory regions that control gene expression. Annotation of regulatory regions is important in understanding phenotypical differences among cells and diseases linked to pathologies in protein expression. Several computational techniques are devoted to mapping out regulatory regions in DNA by initially identifying HS s...
متن کاملRecognition and Classification of Histones Using Support Vector Machine
Histones are DNA-binding proteins found in the chromatin of all eukaryotic cells. They are highly conserved and can be grouped into five major classes: H1/H5, H2A, H2B, H3, and H4. Two copies of H2A, H2B, H3, and H4 bind to about 160 base pairs of DNA forming the core of the nucleosome (the repeating structure of chromatin) and H1/H5 bind to its DNA linker sequence. Overall, histones have a hig...
متن کامل